Distributed out-of-memory NMF on CPU/GPU architectures
نویسندگان
چکیده
Abstract We propose an efficient distributed out-of-memory implementation of the non-negative matrix factorization (NMF) algorithm for heterogeneous high-performance-computing systems. The proposed is based on prior work NMFk, which can perform automatic model selection and extract latent variables patterns from data. In this work, we extend NMFk by adding support dense sparse operation multi-node, multi-GPU resulting optimized problems where memory required to factorize a given greater than available GPU memory. Memory complexity reduced batching/tiling strategies, operations are significantly accelerated with cores (or tensor when available). Input/output latency associated batch copies between host device hidden using CUDA streams overlap data transfers compute asynchronously, collective communications (both intra-node inter-node) NVIDIA Collective Communication Library (NCCL) communicators. Benchmark results show significant improvement, 32X 76x speedup, new GPUs over CPU-based NMFk. Good weak scaling was demonstrated up 4096 cluster nodes approximately 25,000 decomposing 340 Terabyte-size 11 Exabyte-size density $$10^{-6}$$ 10 - 6 .
منابع مشابه
Performance Modeling of Distributed Memory Architectures
We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single source, and multiple source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts along several axes of multi-dimensional arrays, and emulation of butterry networks. We also show ...
متن کاملPerformance Modeling of Multithreaded Distributed Memory Architectures
In multithreaded distributed memory architectures, long{ latency memory operations and synchronization delays are tolerated by suspending the current thread and switching to another thread, which is executed concurrently with the long{latency operation of the suspended thread. Timed Petri nets are used to model several multithreaded architectures at the instruction and thread levels. Model eval...
متن کاملParallel rendering of volumetric data set on distributed-memory architectures
A solution is proposed to the problem of interactive visualization and rendering of volume data. Designed for parallel distributed memory MIMD architectures, the volume rendering system is based on the ray tracing (RT) visualization technique, the Sticks representation scheme (a data structure exploiting data coherence for the compression of classified datasets), the use of a slice-partitioning...
متن کاملComputation of Dendrites on Parallel Distributed Memory Architectures
A code for simulating the solidi cation of a pure material from its undercooled melt based on a phase eld approach has been written for parallel distributed memory architectures using MPI. The numerical scheme is based on nite di erences and results in large sparse non-linear systems which are solved by a backtracking line search modi cation of Newton's method combined with GMRES. Experiments c...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: The Journal of Supercomputing
سال: 2023
ISSN: ['0920-8542', '1573-0484']
DOI: https://doi.org/10.1007/s11227-023-05587-4